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Matrix Factorization 


Roadmap 


@ Embedding Numerous Features: Kernel Models 
@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 


Lecture 14: Radial Basis Function Network 


linear aggregation of distance-based similarities 
using k-Means clustering for prototype finding 













Lecture 15: Matrix Factorization 
ə Linear Network Hypothesis 

ə Basic Matrix Factorization 

ə Stochastic Gradient Descent 
ə Summary of Extraction Models 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 


1/22 


Matrix Factorization Linear Network Hypothesis 


Recommender System Revisited 


data skill 


e data: how ‘many users’ have rated ‘some movies’ 
e skill: predict how a user would rate an unrated movie 





A Hot Problem 
e competition held by Netflix in 2006 


e 100,480,507 ratings that 480,189 users gave to 17,770 movies 
e 10% improvement = 1 million dollar prize 


e data Dnm for m-th movie: 


{(Xn = (N), Yn = fnm): user n rated movie m} 
—abstract feature x, = (N) 


how to learn our preferences from data? j 









Matrix Factorization Linear Network Hypothesis 


Binary Vector Encoding of Categorical Feature 


Xn = (n): user IDs, such as 1126, 5566, 6211, ... 
—called categorical features 





categorical features, e.g. 
e IDs 
e blood type: A, B, AB, O 





e programming languages: C, C++, Java, Python, ... 
many ML models operate on numerical features 


e linear models 
e extended linear models such as NNet 


—except for decision trees 
need: encoding (transform) from categorical to numerical 





binary vector encoding: 


a H000 = 0100". 
ae OOo o 0o00 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/22 


Matrix Factorization Linear Network Hypothesis 


Feature Extraction from Encoded Vector 
encoded data Dm for m-th movie: 


{n = BinaryVectorEncoding(n), Yn = fnm): user n rated movie m} 


or, joint data D 


{ (Xn = BinaryVectorEncoding(n), Yn = [fn ? ? fn4 fns .-- fam”) } 





idea: try feature extraction using N-d-M NNet without all x9 





is tanh necessary? :-) 
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Matrix Factorization Linear Network Hypothesis 


‘Linear Network’ Hypothesis 








{ (Xn = BinaryVectorEncoding(n), yn = [fn 2? ? Tn Tne --- fam”) } 


e rename: V! for [we] and W for [wo] 






e hypothesis: h(x) = W7 Vx 
e per-user output: h(x) = W/ vn, where vp is n-th column of V 


linear network for recommender system: 
learn V and W 


Matrix Factorization Linear Network Hypothesis 


Fun Time 


For N users, M movies, and d ‘features’, how many variables need to 
be used to specify a linear network hypothesis h(x) = W/ Vx? 


@N+Mid 
@N-M-d 

© (N+M)-d 
@ (N-M)+d 





Matrix Factorization Linear Network Hypothesis 


Fun Time 


For N users, M movies, and d ‘features’, how many variables need to 
be used to specify a linear network hypothesis h(x) = W/ Vx? 
@N+M+id 


@N-M-d 
© (N+M)-d 
@ (N-M)+d 











Reference Answer: © 


simply N - d for V” and d - M for W 
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Matrix Factorization Basic Matrix Factorization 


Linear Network: Linear Model Per Movie 


linear network: 
h(x) = W7 vx 
~ 
(x) 
—for m-th movie, just linear model h(x) = w},®(x) 
subject to shared transform ® 





e for every Dm, want lnm = Yn ~ WEVp 
e En over all Dm with squared error measure: 


Ein({Wm} {Yo} = (tom —whvn)® 


ae 1 [Dm] user n rated movie m 





linear network: transform and linear modelS 
jointly learned from all Dm | 


Matrix Factorization Basic Matrix Factorization 
Matrix Factorization 
Inm © WIN = ViWm <> R x VIW 
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Matrix Factorization Model 








































Fg learning: 
ee known rating 
viewer x — learned factors vp and Wm 
Marca movie and ntti predicted — unknown rating prediction 
E e similar modeling can be used for 
NÑć, other abstract features | 





Matrix Factorization Basic Matrix Factorization 


Matrix Factorization Learning 


anti Ein({Wm},{Vn}) « ` (ram = wiwa) 


user n rated movie m 






M 
T 2 
= `> (ram = wan) 


= (Xn,nm)EDm 


3 






e two sets of variables: 
can consider alternating minimization, remember? :-) 


e when v, fixed, minimizing Wm = minimize Ein within Dm 
—simply per-movie (per-Dm) linear regression without wo 
e when Wp fixed, minimizing Vn? 
—per-user linear regression without vo 
by symmetry between users/movies 


called alternating least squares algorithm | 
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Matrix Factorization Basic Matrix Factorization 


Alternating Least Squares 










Alternating Least Squares 





@ initialize d dimension vectors {Wm}, {Vn} 
@ alternating optimization of Ein: repeatedly 
@ optimize w4, w2,..., Wy: 
update Wm by m-th-movie linear regression on {(Vn, fnm) } 
@ optimize v1, V2,..., Vy: 
update vn by n-th-user linear regression on {(Wm, inm) } 
until converge 


e initialize: usually just randomly 


e converge: 
guaranteed as Ein decreases during alternating minimization 





alternating least squares: 
the ‘tango’ dance between users/movies 
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Matrix Factorization Basic Matrix Factorization 


Linear Autoencoder versus Matrix Factorization 


Linear Autoencoder Matrix Factorization 
X = W (W’X) Ræ VW 
° motivation: | motivation: 
special d-d-d linear NNet N-d-M linear NNet 
e error measure: error measure: 
squared on all x,; squared on known rpm 
e solution: global optimal at solution: local optimal via 
eigenvectors of XX alternating least squares 
e usefulness: extract usefulness: extract 
dimension-reduced features hidden user/movie features 





linear autoencoder 
= special matrix factorization of complete X | 
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Matrix Factorization Basic Matrix Factorization 


Fun Time 


How many least squares problems does the alternating least squares 
algorithm needs to solve in one iteration of alternation? 


@ number of movies M 
@® number of users N 
@M+N 


©OM-N 





Basic Matrix Factorization 


Fun Time 


Matrix Factorization 


How many least squares problems does the alternating least squares 
algorithm needs to solve in one iteration of alternation? 

@ number of movies M 

@® number of users N 

@M+N 

OMN = — 





Reference Answer: a 


simply M per-movie problems and N per-user 
problems 
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Matrix Factorization Stochastic Gradient Descent 


Another Possibility: Stochastic Gradient Descent 







En({Wms{voh) x (‘nm -= Win) 


user nrated movie m 
err(user n, movie m, rating fnm) 


SGD: randomly pick one example within the $` & 
update with gradient to per-example err, remember? :-) 


e ‘efficient’ per iteration 
e simple to implement 


e easily extends to other err 





next: SGD for matrix factorization j 


Matrix Factorization Stochastic Gradient Descent 


Gradient of Per-Example Error Function 


err(user n, movie m, rating fnm) = (ram — wan) 2 


Vv  err(user n, Movie m, rating rnm) = 0 unless n = 1126 


V we211 err(user n, movie m, rating fnm) = 0 unless m = 6211 
Vv, err(user n, movie m, rating fnm) = — 2 (ram — wan) Wm 
Vwm  err(user n, movie m, rating fnm) = -2 (Tam — wan) Vn 





per-example gradient 
cx —(residual)(the other feature vector) | 


Matrix Factorization Stochastic Gradient Descent 


SGD for Matrix Factorization 


SGD for Matrix Factorization 
initialize d dimension vectors {Wm}, {vn} randomly 
K i = O losg l 

@ randomly pick (n, m) within all known fnm 

@ calculate residual fam = (fam — W} Vn) 

© SGD-update: 


Ve ye re 


We We rae 





SGD: perhaps most popular large-scale 
matrix factorization algorithm 


Matrix Factorization Stochastic Gradient Descent 


SGD for Matrix Factorization in Practice 
KDDCup 2011 Track 1: World Champion Solution by NTU 





specialty of data (application need): 
per-user training ratings earlier than test ratings in time 


training/test mismatch: typical sampling bias, remember? :-) 





want: emphasize latter examples 


last 7’ iterations of SGD: only those T’ examples considered 
—learned {wm}, {Vn} favoring those 


our idea: time-deterministic &GD that visits latter examples last 
—consistent improvements of test performance 


if you understand the behavior of techniques, 
easier to modify for your real-world use 
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Matrix Factorization Stochastic Gradient Descent 


Fun Time 


If all wm and v, are initialized to the 0 vector, what will NOT happen in 
SGD for matrix factorization? 


© all wn are always 0 

@ all v, are always 0 

© every residual 7, = the original rating fnm 
© En decreases after each SGD update 





Matrix Factorization Stochastic Gradient Descent 


Fun Time 


If all wm and v, are initialized to the 0 vector, what will NOT happen in 
SGD for matrix factorization? 

© all wn are always 0 

@ all v, are always 0 

© every residual 7, = the original rating fnm 

© En decreases after each SGD update 








Reference Answer: (4) 


The 0 feature vectors provides a per-example 
gradient of 0 for every example. So Ein cannot | 
be further decreased. 
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Matrix Factorization Summary of Extraction Models 


Map of Extraction Models 
extraction models: feature transform ® as hidden variables 
in addition to linear model | 
Adaptive/Gradient Boosting 
hypotheses gr; weights a; 
Neural Network/ RBF Network 
Deep Learning 








Matrix Factorization 














weights wi; RBF centers pm; 


weights 6m 


user features Vp; 





weights wi!) 


i movie features Wm 








k Nearest Neighbor 


Xp-neighbor RBF; 
weights yn 


extraction models: a rich family 
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Matrix Factorization Summary of Extraction Models 


Map of Extraction Techniques 


Adaptive/Gradient Boosting 
functional gradient descent 
Neural Network/ RBF Network Matrix Factorization 


Deep Learning 
SGD (backprop) 










SGD 
alternating leastSQR 








k-means clustering 


k Nearest Neighbor 
lazy learning :-) 
extraction techniques: quite diverse | 
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autoencoder 






Matrix Factorization Summary of Extraction Models 


Pros and Cons of Extraction Models 


Neural Network/ RBF Network Matrix Factorization 
Deep Learning 






















‘hard’: 

non-convex optimization 
problems in general 

e overfitting: 

needs proper 

regularization/validation 


e ‘easy’: 
reduces human burden in 
designing features 

e powerful: 

if enough hidden variables 

considered 


be careful when applying extraction models | 


Matrix Factorization Summary of Extraction Models 


Fun Time 


Which of the following extraction model extracts Gaussian centers by 
k-means and aggregate the Gaussians linearly? 


© RBF Network 

© Deep Learning 

© Adaptive Boosting 
© Matrix Factorization 





Matrix Factorization Summary of Extraction Models 


Fun Time 


Which of the following extraction model extracts Gaussian centers by 
k-means and aggregate the Gaussians linearly? 


@ RBF Network 

© Deep Learning 

© Adaptive Boosting 
© Matrix Factorization 








Reference Answer: (1) 


Congratulations on being an expert in 
extraction models! :-) 
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Summary 


@ Embedding Numerous Features: Kernel Models 
@ Combining Predictive Features: Aggregation Models 
© Distilling Implicit Features: Extraction Models 


Lecture 15: Matrix Factorization 


ə Linear Network Hypothesis 
feature extraction from binary vector encoding 
ə Basic Matrix Factorization 

alternating least squares between user/movie 
e Stochastic Gradient Descent 

efficient and easily modified for practical use 
ə Summary of Extraction Models 

powerful thus need careful use 





e next: closing remarks of techniques 





